“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. Many of the datasets and examples used in this course will be drawn from real-world datasets and the techniques learned herein aim to be broadly applicable to multiple fields.
This lesson is the first in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.
This week will be your crash-course on R markdown notebooks and R to refresh on packages and principles that will be relevant throughout our course. In our lectures and your assignments we will be working with some uncurated data to simulate the full experience of working with data from start to finish. It’s important that we are all familiar with, and understand the majority of the tidy data methods that we’ll be using in class so that we can focus on the new material as it appears. We’ll use some standard packages and practices to finesse our data before visualizing it, so let’s R-efresh ourselves.
At the end of this lecture we will have covered the following topics:
tidyverse package.grey background - a package, function, code, command or
directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or
folder
bold - heading or a term that is being defined
blue text - named or unnamed
hyperlink
... - Within each coding cell this will indicate an area
of code that students will need to complete for the code cell to run
correctly.
Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2024-03-Adv_Graphics_R/Lecture_XX. You will find
a partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
At the end of each lecture there will be a completed version of the lecture code released as an HTML file under the Modules section of Quercus.
Today’s datasets will focus on using the Ontario public sector salary disclosure, also known as the “Sunshine list”. This list, started in 1996, publishes all public sector servants that with an annual salary at or above $100,000. Although not strictly biological data, this is a great dataset to work with because it contains many observations set across a long time period with enough data to help generate subgroups based on sector, employer, and role!
You can find more information about this dataset on the Ontario public sector salary disclosure webpage
This is a version of the Sunshine list ranging from 1996-2023. It has been lightly sanitized to reduce in size so the years covered are just about every 5 years. It has further been altered by replacing all names with random numeric identifiers. It is in a tab-separated format and contains nearly 500,000 observations.
This dataset is a table of the monthly inflation rate starting in January 1914, calculated as a static rate of increase. With the proper analysis, it can be used to compare the consumer price index across various timespans. This data was obtained from the Bank of Canada
tidyverse which has a number of packages including
dplyr, tidyr, stringr,
forcats and ggplot2
magrittr will allow us to use a number of different
piping/redirect options
viridis helps to create color-blind palettes for our
data visualizations
Let’s run our first code cell!
# Packages to help tidy our data
library(tidyverse)
library(magrittr)
# Packages for the graphical analysis section
library(viridis)
Work with your R markdown notebook on the University of Toronto datatools hub will all be contained within a new browser tab with the address bar showing something similar to
https://r.datatools.utoronto.ca/user/calvin.mok@utoronto.ca/rstudio/
All of this is running remotely on a University of Toronto server rather than your own machine.
You’ll see a directory structure from your home folder:
ie /home/rstudio/2024-03-Adv_Graphics_R/ and a folder to
Lecture_01_R_Introduction within. Clicking on that, you’ll
find Lecture_01.R-efresher.skeleton.Rmd which is the
notebook we will use for today’s code-along lecture.
We’ve implemented the class this way to reduce the burden of having to install various programs. While installation can be a little tricky, it’s really not that bad. For this course, however, you don’t need to go through all of that just to improve on your data visualization skills.
R markdown notebooks also give us the option of inserting “markdown” text much like what you’re reading at this very exact moment. So we can intersperse ideas and information between our learning code blocks.
There is, however an appendix section at the end of this lecture detailing how to install the R-kernel itself and the integrated development environment (IDE) called RStudio.
So… what are in these packages? A package can be a collection of - functions - data objects - compiled code - functions that override base functions in R
Functions are the basic workhorses of R; they are the tools we use to analyze our data. Each function can be thought of as a unit that has a specific task. A function takes an input, evaluates it using an expression (e.g. a calculation, plot, merge, etc.), and returns an output (a single value, multiple values, a graphic, etc.).
In this course we will frequently rely on a package called the
tidyverse which is also composed of a series of other
packages we can use to reformat our data like readr,
dplyr, tidyr and stringr.
Behind the scenes of each markdown notebook the R kernel is running. As we move from code cell to new code cell, all of the variables or objects we have created are stored within memory. We can refer to these as we run the code and move forward but if you overwrite or change them by mistake, you may to have rerun multiple cell blocks!
There are some options in the “Code” menu that can alleviate these problems such as “Run Region > Run All Chunks Above”. If you think you’ve made a big error by overwriting a key object, you can use that option to “re-initialize” all of your previous code!
Unfortunately, the run order of your code is not tracked. When a code
cell is still actively running you will see in the top-right corner of
the Console window (lower pane) denoted as a
STOP sign icon. Clicking on this will interrupt the kernel
and stop code execition, although depending on the complexity of the
code it may take a moment.
Remember these friendly keys/shortcuts:
Arrow keys to navigate up and down (and within a
cell)Ctrl+Shift+Enter to run a
cell (both code and markdown)Ctrl+Enter to run a single line of code
within a code cellAlt+Ctrl+Enter to run the
next cellCtrl+Shift+C to quickly
comment and uncomment single or multiple lines of codeTab can be used while coding to autocomplete variable,
function and file names, and even look at a list of possible parameters
for functions.Ctrl+Alt+I to insert a new
coding cellDepending on your needs, you may find yourself doing the following:
Markdown allows you to alternate between “markdown” notes and “code” that can be run or re-run on the fly.
Each data run and it’s results can be saved individually as a new notebook to compare data and small changes to analyses!
Markdown is a markup language that lets you write HTML and Java Script code in combination with other languages. This allows you to make html, pdf, and text documents that are combinations of text and code, enhancing reproducibility, a key aspect in scientific work. Having everything in a single place also boosts productivity during results interpretation - no need to go back and forth between tabs, pages, and documents. They can all be integrated in a single document, allowing for a more fluid narrative of the story that you are communicating to your audience (less distractions for you!). For example, the lines of code below and the text you are reading right now were created in R’s Markdown language. (Do not worry about the R code just yet. We will get there sooner than you think).
As mentioned, markdown also allows you to write in LaTeX, a document preparation system to write mathematical notation. All it takes is to wrap LaTeX code between single dollar signs ($) for inline notation or two double dollar signs ($$), one at the beginning of the equation and one at the end. For example, the equation Yi = beta0 + beta1 xi + epsilon_i, i=1, …, N can be transformed into LaTeX code by adding some characters: ***Y_i = _0 + _1 x_i + _i, i=1, , N***. Now, if we use $$ before and after the LaTeX code, this is what we get:
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1, \dots,N \]
See? Just like that! Here is an example of a table made in Markdown, showing some of the most popular R libraries for data science:
| Library | Use |
|---|---|
| tidyverse | Simplified tabular-data processing functions |
| ggplot2 | Data visualization package typically included in the tidyverse |
| shiny | Used to create interactive R-based web pages and interfaces |
| car | Popular statistical analysis with Type II and III ANOVA tables |
These are just a few examples of what you can do with Jupyter and Markdown. To find out more on how to get the best of Markdown, head on over to the [R Markdown cookbook] (https://bookdown.org/yihui/rmarkdown-cookbook/).
Once you are finished writing your code and interpreting those results in a markdown notebook, you can render the notebook into pdf, html, and many other formats. There are several ways to achieve this. The easiest option is to go to File > Knit Document. Afterwards there should be an option to view in browser at which point you can save as an HTML or print it to PDF.
Let’s discuss some important behaviours before we begin coding:
Code annotation (commenting)
Variable naming conventions
Best practices
# symbolWhy bother?
“Can you rerun this analysis and but change X parameter?” - Anonymous PI
“Can you make this plot, but with dashed lines, a different axis, with error bars?” - Anonymous labmate
“Can I borrow your code?” - Anonymous collaborator or officemate or PI
“Why is that object being sent to that function? What is it returning?” - You, Me, and anyone reading your code
Your worst collaborator is potentially you in 6 days or 6 months. Do you remember what you had for breakfast last Tuesday?
Credit: https://www.testbytes.net/blog/programming-memes/
You can annotate your code for selfish reasons, or altruistic reasons, but annotate your code.
How do I start?
It is, in general, part of best coding practices to keep things tidy and organized.
A hash-tag # will comment your text. Inside a code
cell in an R notebook or anywhere in an R script, all text
after a hashtag will be ignored by R and by many other
programming languages. It’s very useful to add comments about changes in
your code, as well as detailed explanations about your scripts.
Put a description of what you are doing near your code at every
process, decision point, or non-default argument in a function. For
example, why you selected k=6 for an analysis, or the
Spearman over Pearson option for your correlation matrix, or quantile
over median normalization, or why you made the decision to filter out
certain samples.
Break your code into sections to make it readable. Scripts are just a series of steps and major steps should be titled/outlined with your reasoning - much like when presenting your research.
Give your objects informative object names that are not the same as function names.
Comments may/should appear in three places:
# Example commenting section
# At the beginning of the script, describing the purpose of your script and what you are trying to solve
bedmasAnswer <- 5 + 4 * 6 - 0 #In line: Describing a part of your code that is not obvious what it is for.
#---------- Section dividers helps organize code structure ----------#
## Feel free to add extra hash tags to visually separate or emphasize comments
Maintaining well-documented code is also good for mental health!
Stylistically, you have the following options:
The most important aspects of naming conventions are being concise
and consistent! Throughout this course you’ll see a hybrid system that
uses the underscore to separate words but a
period right before denoting the object type ie
this_data.object.
Start each script with a description of what it does.
Then load all required packages.
Consider what working directory you are in when sourcing a script.
Use comments to mark off sections of code.
Put function definitions at the top of your file, or in a separate file if there are many.
Name and style code consistently.
Break code into small, discrete pieces.
Factor out common operations rather than repeating them.
Keep all of the source files for a project in one directory and use relative paths to access them.
Keep track of the memory used by your program.
Always start with a clean environment instead of saving the workspace.
Keep track of session information in your project folder.
Have someone else review your code.
Use version control.
For more information on best coding practices, please visit swcarpentry
We all run into problems. We’ll see a lot of mistakes happen in class too! That’s OK if we can learn from our errors and quickly (or eventually) recover.
Usually when R generates an error it will produce some information about what has happened. This usually includes an error message detailing the kind of error it encountered or an error message generated by the function. It can also include a line where the error was encountered, or the name of the last function that was called before the error was encountered.
file does not exist: Use getwd() to check
where you are working, typelist.files() or the
Files pane to check that your file exists there, and
setwd() to change your directory if necessary. Preferably,
work inside an R project with all project-related files in that
same folder. Your working directory will be set automatically when you
open the project (this can be done by using
File -> New File -> R Notebook and following
prompts).
typos: R is case sensitive so always check that you’ve spelled everything right. Get used to using the tab autocomplete feature when possible. This can reduce typos and increase your overall programming speed.
open quotes, parentheses, brackets:
data type: Use commands like typeof() and
class() to check what type of data you have. Use
str() to peak at your data structures if you’re making
assumptions about it.
unexpected answers: To access the help menu,
type help("function"), ?function (using the
name of the function that you want to check), or
help(package = "package_name").
function not found: Make sure the package name is
properly spelled, installed, AND loaded. Libraries can be loaded to the
environment using the function library("package_name"). If
you only need one function from a package, or need to specify to what
package a function belongs because there are functions with the same
name that belong to different packages, you can use a double colon,
i.e. package_name::function_name.
the R bomb!!: The session aborted can
happen for a variety of reasons, like not having enough computational
power to perform a task or also because of a system-wide failure.
Session -> Restart R. You will need to rerun your
previous cells!cheatsheets: Meet your new best friends: cheatsheets!
99% of the time, someone has already asked your question
Google, Stack overflow, R Bloggers, SEQanswers, Quora,
ResearchGate, RSeek, twitter, even reddit
Including the program, version, error, package and function helps, so be specific. Sometimes it is useful to include your operating system and version (Windows 10, Ubuntu 18, Mac OS 10, etc.).
You may run into assignment questions where the tools I’ve provided in lecture are not enough to reproduce the example output exactly as provided. If you wish to go that extra mile you may need to look for answers elsewhere by consulting references from the class or searching for it yourself. The truth is out there!
Remember: Everyone looks for help online ALL THE TIME. It is very common. Also, with programming there are multiple ways to come up with an answer, even different packages that let you do the same thing in different ways. You will work on refining these aspects of your code as you go along in this course and in your coding career.
Last but not least, to make life easier: Under the Help
pane, there is a list of cheatsheets related to RStudio, the tidyverse
and other useful packages.
There are many tips and tricks to remember about R but here we’ll quickly recall some fundamental knowledge that could be relevant in later lectures.
If we want to hold on to a number, calculation, or object we need to assign it to a named variable. R has multiple methods for assigning a value to a variable and an order of precedence!
-> and ->> Rightward
assignment: we won’t really be using this in our course.
<- and <<- Leftward
assignment: assignment used by most ‘authentic’ R programmers
but really just a historical keyboard throwback.
= Leftward assignment: commonly used
token for assignment in many other programming languages but holds dual
meaning!
In R, the assignment of a variable does not produce any standard output.
R processes at each new line unless you use a semicolon (;) to
separate commands. This applies to assignment as well. One exception
being when your function calls are spaced across lines and contained
within the ().
R calculates the right side of the assignment first the result is then applied to the left.
Data types are used to classify the basic spectrum of values that are used in R. Here’s a table describing some of the common data types we’ll encounter.
| Data type | Description | Example |
|---|---|---|
| character | Can be single or multiple characters (strings) of
letters and symbols. Assigned using double ' or
" |
a#c&E |
| integer | Whole number values, either positive or negative | 1 |
| double | Any number that is not an integer | 7.5 |
| logical | Also known as a boolean, representing the state of a conditional (question) | TRUE or FALSE |
| factor | Used as a way to make categorical values. Often used as a finite set of values that appear to be string-based in nature except they can be given a user-specified order. | Yes/No or Low/Medium/High |
| NA | Represents the value of “Not Available” usually seen when imported data has missing values | NA |
The job of data structures is to “host” the different data types. There are five basic types of data structures that we’ll use in R:
| Data structure | Dimensions | Restrictions |
|---|---|---|
| vector | 1D | Holds a single data type |
| matrix | 2D | Holds a single data type |
| array | nD | Holds a single data type |
| data frame | 2D | Holds multiple data types with some restrictions |
| list | 1D (technically) | Holds multiple data types AND structures |
Sometimes it is helpful to imagine Data Structures as real-world objects to understand how they are shaped and related to each other.
Also known as atomic vectors, each element within a vector must be of the same data type: logical, integer, double, character, complex, or raw.
For each vector there are two key properties that can be queried
with typeof() and length().
There is a numerical order to a vector, much like a queue AND you
can access each element (piece of data) individually or in groups.
Elements are ordered from 1 to
length(your_vector) and can be accessed with an indexing
operator []
Elements of a vector may be named, to facilitate subsetting by character vectors.
Elements of a vector may be subset by a logical vector.
# Build a character vector
char.vector <- c("Canada", "United States", "Great Britain")
char.vector
## [1] "Canada" "United States" "Great Britain"
# subset by a single value
char.vector[2]
## [1] "United States"
# subset by multiple values
char.vector[2:3]
## [1] "United States" "Great Britain"
# subset by removing values (cannot be mixed with positive values)
char.vector[c(-1, -3)]
## [1] "United States"
# subset with repeating multiple values
char.vector[c(1, 2, 3, 3, 2, 1)]
## [1] "Canada" "United States" "Great Britain" "Great Britain"
## [5] "United States" "Canada"
# Build a named character vector by including variable names
character.vector <- c(a = "Canada", b = "United States", c = "Great Britain")
character.vector
## a b c
## "Canada" "United States" "Great Britain"
# subset by element name
character.vector[c("a", "b")]
## a b
## "Canada" "United States"
# subset by an explicit vector of logicals
character.vector[c(FALSE, TRUE, TRUE)]
## b c
## "United States" "Great Britain"
# Or subset by an implicit vector of logicals
character.vector[character.vector != "Canada"]
## b c
## "United States" "Great Britain"
R will implicitly force (coerce) your vector to be of one data type. In this case, the type that is most inclusive is a character vector. When we explicitly coerce a change from one data type to the next, it is known as casting. You can cast between certain data types and also object types.
Type-casting examples: as.logical(),
as.integer(), as.double(),
as.numeric(), as.character(), and
as.factor()
Structure casting examples: as.data.frame(),
as.list(), and as.matrix()
Importantly, when coercing, the R kernel converts from more specific to general types usually in this order:
logical \(\rightarrow\) integer \(\rightarrow\) numeric \(\rightarrow\) complex \(\rightarrow\) character \(\rightarrow\) list.
# Make a logical vector and display its structure
logical.vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
str(logical.vector)
## logi [1:5] TRUE FALSE TRUE FALSE FALSE
# Make a numeric vector and display its structure
numeric.vector <- c(-1:10)
str(numeric.vector)
## int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
# Make a mixed vector and display its structure. Take a note of its typing afterwards
mixed.vector <- c(FALSE, TRUE, 1, 2, "three", 4, 5, "six")
str(mixed.vector)
## chr [1:8] "FALSE" "TRUE" "1" "2" "three" "4" "5" "six"
# Attempt to coerce our vectors
# logical to numeric
as.numeric(logical.vector)
## [1] 1 0 1 0 0
# numeric to logical
as.logical(numeric.vector)
## [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# numeric to character
as.character(numeric.vector)
## [1] "-1" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
# mixed to a numeric. Note what happens when elements cannot be converted
as.numeric(mixed.vector)
## Warning: NAs introduced by coercion
## [1] NA NA 1 2 NA 4 5 NA
Now that we have had the opportunity to create a few different vector objects, let’s talk about what an object class is. An object class can be thought of as a structure with attributes that will behave a certain way when passed to a function. Because of this
Some R package developers have created their own object classes. For
example, many of the functions in the tidyverse generate
tibble objects. They behave in most ways like a
data.frame but have a more refined print structure, making
it easier to see information such as column types when viewing them
quickly. In general, from a trouble-shooting standpoint, it is good to
be aware that your data may need to be formatted to fit a
certain class of object when using different packages.
After we are done tidying most of our datasets, they will be in tibble objects, but all of the basic data frame functions apply to these as well.
While matrices are 2-dimensional structures limited to a single specific type of data within each instance, data frames treat each column of the structure like a vector. The data frame, however, can have multiple data types mixed across each different column. Data frame rules to remember are:
Data frames allows us to generate tables of mixed information much like an Excel spreadsheet.
# Generate a data frame with different variable/column types
mixed.df <- data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwalth = logical.vector[1:3])
# View the data frame
mixed.df
## country values commonwalth
## a Canada 0 TRUE
## b United States 1 FALSE
## c Great Britain 2 TRUE
# Check the structure of the data frame
str(mixed.df)
## 'data.frame': 3 obs. of 3 variables:
## $ country : chr "Canada" "United States" "Great Britain"
## $ values : int 0 1 2
## $ commonwalth: logi TRUE FALSE TRUE
nrow(data_frame) retrieves the number of rows in a
data frame.
ncol(data_frame) retrieves the number of columns in
a data frame.
data_frame$column_name accesses a specific column by
it’s name.
data_frame[x,y] accesses a specific element located
at row x, column y
rownames(data_frame) retrieves or assigns row names
to your data frame
colnames(data_frame) retrieves or assigns columns
names to your data frame
There are many more ways to access and manipulate data frames that we’ll explore further down the road. Let’s review some basic data frame code.
# query the dimensions of the data frame
dim(mixed.df)
## [1] 3 3
nrow(mixed.df)
## [1] 3
ncol(mixed.df)
## [1] 3
# retrieve row and column names
rownames(mixed.df)
## [1] "a" "b" "c"
colnames(mixed.df)
## [1] "country" "values" "commonwalth"
# print the mixed data frame
mixed.df
## country values commonwalth
## a Canada 0 TRUE
## b United States 1 FALSE
## c Great Britain 2 TRUE
# Access portions of the data frame
# a single column
str(mixed.df$country)
## chr [1:3] "Canada" "United States" "Great Britain"
# a single element
mixed.df[2, 3] # Use index position
## [1] FALSE
mixed.df[3, "country"] # Mix index position and column names
## [1] "Great Britain"
# multiple rows
mixed.df[c(1,3),] # Use vectors to select groups of rows/columns
## country values commonwalth
## a Canada 0 TRUE
## c Great Britain 2 TRUE
mixed.df[-2, ] # Use negative values to EXCLUDE rows/columns
## country values commonwalth
## a Canada 0 TRUE
## c Great Britain 2 TRUE
Lists can hold mixed data types of different lengths. These are especially useful for bundling data of different types to pass around your scripts, and functions, or when receiving output from functions! Rather than having to call multiple variables by name, you can store them in a single list!
If you forget the contents of your list, use the str()
function to check out its structure. str() will tell you
the number of items in your list and their data types.
# Make a named list of various items
mixed.list <- list(countries = character.vector, values = numeric.vector, mixed.data = mixed.df)
# Look at some information about our list
str(mixed.list)
## List of 3
## $ countries : Named chr [1:3] "Canada" "United States" "Great Britain"
## ..- attr(*, "names")= chr [1:3] "a" "b" "c"
## $ values : int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
## $ mixed.data:'data.frame': 3 obs. of 3 variables:
## ..$ country : chr [1:3] "Canada" "United States" "Great Britain"
## ..$ values : int [1:3] 0 1 2
## ..$ commonwalth: logi [1:3] TRUE FALSE TRUE
# What are the names of the elements in mixed.list
names(mixed.list)
## [1] "countries" "values" "mixed.data"
Note the $ sign on the left-hand side of the
str() output. What follows is the name of our list element
proceeded by the : and a description of that element.
# Lists can often be unnamed
unnamed.list <- list(character.vector, numeric.vector, mixed.df)
# Look at some information about our unnamed list
str(unnamed.list)
## List of 3
## $ : Named chr [1:3] "Canada" "United States" "Great Britain"
## ..- attr(*, "names")= chr [1:3] "a" "b" "c"
## $ : int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
## $ :'data.frame': 3 obs. of 3 variables:
## ..$ country : chr [1:3] "Canada" "United States" "Great Britain"
## ..$ values : int [1:3] 0 1 2
## ..$ commonwalth: logi [1:3] TRUE FALSE TRUE
names(unnamed.list)
## NULL
Accessing lists is much like opening up a box of boxes of chocolates. You never know what you’re gonna get when you forget the structure!
You can access elements with a mixture of number and naming
annotations much like data frames. Also [[x]] is meant to
access the xth “element” of the list. Note that unnamed lists
cannot be accessed with naming annotations.
[x] returns a list object with your element(s) of
choice in the list.[[x]] returns a “single” element only but that element
could be a vector, data frame, list, etc.# Subset our list with []
mixed.list[c(1,3,2)]
## $countries
## a b c
## "Canada" "United States" "Great Britain"
##
## $mixed.data
## country values commonwalth
## a Canada 0 TRUE
## b United States 1 FALSE
## c Great Britain 2 TRUE
##
## $values
## [1] -1 0 1 2 3 4 5 6 7 8 9 10
str(mixed.list["values"])
## List of 1
## $ values: int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
# Pull out a single element
mixed.list[[2]]
## [1] -1 0 1 2 3 4 5 6 7 8 9 10
mixed.list[["countries"]]
## a b c
## "Canada" "United States" "Great Britain"
# Give a vector as input to [[]]
mixed.list[[c(1,3)]]
## [1] "Great Britain"
# vs equivalent
mixed.list[[1]][3]
## c
## "Great Britain"
# Access a single element from a data frame nested in a list
mixed.list[[c(3,1,1)]]
## [1] "Canada"
# vs equivalient
mixed.list[[3]][1,1]
## [1] "Canada"
Comprehension Question 2.2.4.1: Suppose we had a list named multiDF.list consisting of 3 data frames, as shown in the following code cell. How would you subset the 2nd and 3rd data frames into their own list? How would you access the “values” column from the 3rd data frame? Use the following code cell to help you out.
# Comprehension answer code 2.2.4.1
multiDF.list = list(mixed.df, rbind(mixed.df, mixed.df), rbind(mixed.df, mixed.df, mixed.df))
str(multiDF.list)
# Subset the 2nd and 3rd dataframes as their own list
...
# Output the "values" column of the 3rd dataframe
...
Ah, the dreaded factors! A factor is a class of object used to encode a character vector into categories. They are used to store categorical variables and although it is tempting to think of them as character vectors this is a dangerous mistake. Adding or changing data in a data frame with pre-existing factors requires that you match factor levels correctly as well.
Factors make perfect sense if you are a statistician designing a
programming language (!) but to everyone else they exist solely to
torment us with confusing errors. At its core, a factor is really just
an integer vector or character data with an additional attribute, called
levels(), which defines the accepted values for that
variable.
Why not just use character vectors, you ask?
Believe it or not factors do have some useful properties. For example, factors allow you to specify all possible values a variable may take even if those values are not in your data set. Think of conditional formatting in Excel. We also use them heavily in generating statistical analyses and in grouping data when we want to visualize it.
Since the inception of R, data.frame() calls have been
used to create data frames but the default behaviour
was to convert strings (and characters) to factors! This is a throwback
to the purpose of R, which was to perform statistical analyses on
datasets with methods like ANOVA which examine the
relationships between variables (ie factors)!
As R has become more popular and its applications and packages have expanded, incoming users have been faced with remembering this obscure behaviour, leading to lost hours of debugging grief as they wonder why they can’t pull information from their dataframes to do a simple analysis on C. elegans strain abundance via molecular inversion probes in datasets of multiplexed populations. #SuspiciouslySpecific
That meant that users usually had to create data frames including the toggle
data.frame(name=character(), value=numeric(), stringsAsFactors = FALSE)
Fret no more! As of R 4.x.x the default behaviour
has switched and stringsAsFactors = FALSE is the
default! Now if we want our character columns to be
factors, we must convert them explicitly, or
turn this behaviour on at the outset of creating each data frame!
# Generate a data frame and include factors for all character-based content
str(data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
stringsAsFactors = TRUE)
)
## 'data.frame': 3 obs. of 4 variables:
## $ country : Factor w/ 3 levels "Canada","Great Britain",..: 1 3 2
## $ values : int 0 1 2
## $ commonwealth: logi TRUE FALSE TRUE
## $ continent : Factor w/ 2 levels "Europe","North America": 2 2 1
# Explicitly define factors for specific variables.
str(data.frame(country = factor(character.vector),
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
stringsAsFactors = FALSE)
)
## 'data.frame': 3 obs. of 4 variables:
## $ country : Factor w/ 3 levels "Canada","Great Britain",..: 1 3 2
## $ values : int 0 1 2
## $ commonwealth: logi TRUE FALSE TRUE
## $ continent : chr "North America" "North America" "Europe"
factors and their levels explicitly
during or after data.frame creationFrom above, you can specify which columns of strings are converted to factors at the time of declaring your column information. Alternatively you can coerce character vectors to factors after generating them.
R’s default behaviour puts factor levels in alphabetical
order. This can cause problems if we aren’t aware of it. You can
check the order of your factor levels with the levels()
command. Furthermore you can specify, during factor creation, your level
order.
Always check to make sure your factor levels are what you expect.
With factors, we can deal with our character levels directly, or their numeric equivalents.
# Generate a data frame and include factors
str(data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = factor(c("North America", "North America", "Europe"),
levels = c("North America", "Europe"))
)
)
## 'data.frame': 3 obs. of 4 variables:
## $ country : chr "Canada" "United States" "Great Britain"
## $ values : int 0 1 2
## $ commonwealth: logi TRUE FALSE TRUE
## $ continent : Factor w/ 2 levels "North America",..: 1 1 2
# Coerce a factor after the fact
# Build a data frame
mixed.df <- data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"))
# Set our factor after declaring the data frame
mixed.df$continent <- factor(mixed.df$continent, levels=c("North America", "Europe"))
str(mixed.df)
## 'data.frame': 3 obs. of 4 variables:
## $ country : chr "Canada" "United States" "Great Britain"
## $ values : int 0 1 2
## $ commonwealth: logi TRUE FALSE TRUE
## $ continent : Factor w/ 2 levels "North America",..: 1 1 2
Use levels() to list the levels and their order for
your factor
To rename levels of a factor, declare and reassign your factor.
Move a single level to the first position within your factor
levels with relevel().
Factor levels can be assigned an order of precedence during their
creation with the parameter ordered = TRUE.
Define labels for your factor during their creations with the
parameter labels = c(). Note that level order is assigned
before labels are added to your data. You are essentially
labeling the integer assigned to your factor levels so be
careful when using this parameter!
Advanced factors functions with forcats. If you’re looking for more advanced functions that you can use to manipulate, sort or update factors, check out the forcats function. With it, you can refactor based on functions, frequency, or explicitly re-specify the order of one or more factor levels. We’ll see this package in action in more detail during later lectures.
Yes, you can treat data frames and arrays like large lists where mathematical operations can be applied to individual elements or to entire columns or more!
TRUE/FALSE): coercion to numeric before
applying operationsTherefore be careful to specify your numeric data for mathematical operations.
mixed.df
## country values commonwealth continent
## a Canada 0 TRUE North America
## b United States 1 FALSE North America
## c Great Britain 2 TRUE Europe
# Add to each element
mixed.df$values + 3
## [1] 3 4 5
# Add columns to each other
mixed.df$values + mixed.df$values
## [1] 0 2 4
# multiply each element by a constant
mixed.df$values * 4
## [1] 0 4 8
# implicit coercion of logical to integer
mixed.df$commonwealth * 5
## [1] 5 0 5
# Perform math on a factor
mixed.df$continent * 6
## Warning in Ops.factor(mixed.df$continent, 6): '*' not meaningful for factors
## [1] NA NA NA
# Convert the factor to a numeric first
as.numeric(mixed.df$continent) * 7
## [1] 7 7 14
# Can we perform math on non-numeric variables?
mixed.df$country + 8
## Error in mixed.df$country + 8: non-numeric argument to binary operator
apply() family of functions to perform
actions across data structuresThe above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.
apply() function will recognize basic
functions and use them on vectorized dataFor example, we might have a count table where rows are genes,
columns are samples, and we want to know the sum of all the counts for a
gene. To do this, we can use the apply() function.
apply() Takes an array, matrix (or something that can be
coerced as such, like a numeric data frame), and applies a function over
rows or columns. The apply() function takes the following
parameters:
X: an array. matrix or something that can be coerced to
these objectsMARGIN: defines how to apply the function;
1 = rows, 2 = columns.FUN: the function to be applied. Supplied as a function
name without the () suffix...: this notation means we can pass additional
parameters to our function defined by FUN.and returns a vector, array or list depending on the nature of X.
Let’s practice by invoking the sum function.
# Make a sample data frame of numeric values only
numeric.df = data.frame(geneA = numeric.vector, geneB = numeric.vector*2, geneC = numeric.vector*3)
# We now have a 12x3 dataframe
numeric.df
## geneA geneB geneC
## 1 -1 -2 -3
## 2 0 0 0
## 3 1 2 3
## 4 2 4 6
## 5 3 6 9
## 6 4 8 12
## 7 5 10 15
## 8 6 12 18
## 9 7 14 21
## 10 8 16 24
## 11 9 18 27
## 12 10 20 30
# Apply sum() to each rows
apply(numeric.df, MARGIN = 1, sum)
## [1] -6 0 6 12 18 24 30 36 42 48 54 60
# Apply sum() to each column
apply(numeric.df, 2, sum)
## geneA geneB geneC
## 54 108 162
apply() familyThere are 3 additional members of the apply() family
that perform similar functions with varying outputs
lapply(data, FUN, ...) is usable on dataframes, lists,
and vectors. It returns a list as output.FUN will be applied from the
...sapply(data, FUN, ...) works similarly to
lapply() except it tries to
simplify the output to the most elementary
data structure possible. i.e. it will return the simplest form of the
data that makes sense as a representation.
mapply(FUN, data, ...) is short for “multivariate”
apply and it applies a function to multiple lists or multiple vector
arguments.
# Use lapply on the columns of numeric.df
lapply(numeric.df, sum)
## $geneA
## [1] 54
##
## $geneB
## [1] 108
##
## $geneC
## [1] 162
str(lapply(numeric.df, sum))
## List of 3
## $ geneA: int 54
## $ geneB: num 108
## $ geneC: num 162
# Use sapply on the columns of numeric.df
sapply(numeric.df, sum)
## geneA geneB geneC
## 54 108 162
# We are returned a named vector
str(sapply(numeric.df, sum))
## Named num [1:3] 54 108 162
## - attr(*, "names")= chr [1:3] "geneA" "geneB" "geneC"
# Using lapply and sapply and sum on an actual list
sum.list <- list(numeric.vector, numeric.df)
str(sum.list)
## List of 2
## $ : int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
## $ :'data.frame': 12 obs. of 3 variables:
## ..$ geneA: int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
## ..$ geneB: num [1:12] -2 0 2 4 6 8 10 12 14 16 ...
## ..$ geneC: num [1:12] -3 0 3 6 9 12 15 18 21 24 ...
# lapply on the list returns a list
lapply(sum.list, sum)
## [[1]]
## [1] 54
##
## [[2]]
## [1] 324
# sapply on the list returns a vector
sapply(sum.list, sum)
## [1] 54 324
# Use lapply to select portions from a list
sum.list <- list(numeric.df, numeric.df)
# Extract the first row from each member of the list
print("sum.list first rows:")
## [1] "sum.list first rows:"
lapply(sum.list, "[", 1, )
## [[1]]
## geneA geneB geneC
## 1 -1 -2 -3
##
## [[2]]
## geneA geneB geneC
## 1 -1 -2 -3
# Take a close look at what sapply returns in this case
sapply(sum.list, "[", 1,)
## [,1] [,2]
## geneA -1 -1
## geneB -2 -2
## geneC -3 -3
# Extract the 2nd column from each member of the list
print("sum.list second columns:")
## [1] "sum.list second columns:"
lapply(sum.list, "[", , 2)
## [[1]]
## [1] -2 0 2 4 6 8 10 12 14 16 18 20
##
## [[2]]
## [1] -2 0 2 4 6 8 10 12 14 16 18 20
# Take a close look at what sapply returns in this case
sapply(sum.list, "[", , 2)
## [,1] [,2]
## [1,] -2 -2
## [2,] 0 0
## [3,] 2 2
## [4,] 4 4
## [5,] 6 6
## [6,] 8 8
## [7,] 10 10
## [8,] 12 12
## [9,] 14 14
## [10,] 16 16
## [11,] 18 18
## [12,] 20 20
Notice how in using sapply() to extract from a list of
data frames, a single matrix was returned - a single output in the
simplest form that maintains structure.
Now let’s give mapply() a try.
# Use mapply in an example on numeric.vector
mapply(sum, numeric.vector, numeric.vector)
## [1] -2 0 2 4 6 8 10 12 14 16 18 20
numeric.vector + numeric.vector
## [1] -2 0 2 4 6 8 10 12 14 16 18 20
# Use mapply in an example on numeric.df
mapply(sum, numeric.df, numeric.df)
## geneA geneB geneC
## 108 216 324
# Use mapply on the rep function to see its output
mapply(rep, c("repeat", "this", "phrase"), 4)
## repeat this phrase
## [1,] "repeat" "this" "phrase"
## [2,] "repeat" "this" "phrase"
## [3,] "repeat" "this" "phrase"
## [4,] "repeat" "this" "phrase"
So from our observations, we see that mapply() it looks
like
numeric.vector + numeric.vectorIn each case, it is applying sum to the first elements
(or columns) of each argument, then the second elements, and so on. So
new sets are formed for each element-wise position before applying the
FUN argument!
NA and NaN valuesMissing values in R are handled as NA (Not Available).
Impossible values (like the results of dividing by zero) are represented
by NaN (Not a Number). These types of values can be
considered null values. These two types of values, specially
NAs, have special ways to be dealt with, otherwise it may lead to errors
in some functions.
For our purposes, we are not interested in keeping NA
data within our datasets so we will usually detect and remove them or
replace them within our data after it is imported.
NA datais.na() returns a logical vector reporting which values
from your query are NA.complete.cases() returns row-matched logical vector
with a value of TRUE for rows without any
NA values.NA values with the
na.rm = TRUE parameter: ie mean(),
sum() etc.tidyr package can also be
used to work with NA values.# Add some NAs to our data frame
mixed.df <- data.frame(country = character.vector,
values = c(3, NA, 9),
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
measure = c("metric", NA, "metric")
)
# Look at our updated data frame
mixed.df
## country values commonwealth continent measure
## a Canada 3 TRUE North America metric
## b United States NA FALSE North America <NA>
## c Great Britain 9 TRUE Europe metric
# Which entries are NA?
is.na(mixed.df)
## country values commonwealth continent measure
## a FALSE FALSE FALSE FALSE FALSE
## b FALSE TRUE FALSE FALSE TRUE
## c FALSE FALSE FALSE FALSE FALSE
# Which rows are incomplete?
complete.cases(mixed.df)
## [1] TRUE FALSE TRUE
# Use some math functions
sum(mixed.df$values, na.rm=TRUE)
## [1] 12
tidyverseEach dataset has it’s own problems. Image from: https://cfss.uchicago.edu/notes/tidy-data/
Let’s begin with some definitions:
Variable: A part of an experiment that can be controlled, changed, or measured.
Observation: The results of measuring the variables of interest in an experiment.
Wide-format data: variables may be listed in the first column, each forming a row of its own. Observations may be presented as columns that fill observed values for each variable.
Long-format data: each variable is its own column, and the results of each measured variable are recorded in rows.
In data science, long format is preferred over wide format because it allows for an easier and more efficient subsetting and manipulation of the data. To read more about wide and long formats, visit here.
Why tidy data?
Data cleaning/wrangling (or dealing with ‘messy’ data) accounts for a huge chunk of a data scientist’s time. Ultimately, we want to get our data into a ‘tidy’ format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that standardized data structure can help this process along.
In Tidy data:
This seems pretty straightforward, and it is. It is the datasets you get that will not be straightforward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.
Observational units: Of the three rules, the idea of observational units might be the hardest to grasp. As an example, you may be tracking a puppy population across 4 variables: age, height, weight, fur colour. Each observation unit is a puppy. However, you might be tracking the same puppies across multiple measurements - so a time factor applies. In that case, the observation unit now becomes puppy-time. Now each puppy-time measurement belongs in a different table (at least by tidy data standards). This, however, is a simple example and things can get more complex when taking into consideration what defines an observational unit. Check out this blog post by Claus O. Wilke for a little more explanation.
Let’s begin this journey with data import.
readr package -
“All roads lead to Rome..”… but not all roads are easy to travel.
Depending on format, data files can be opened in a number of ways.
The simplest methods we will use involve the readr package
as part of the tidyverse. These functions have already been
developed to simplify the import process for users. The functions we
will use most often are:
Read in a delimited file: read_delim(),
read_csv(), read_tsv(),
read_csv2() [European datasets]
Read in from a file, line by line:
read_lines()
Let’s read in our first dataset so that we can convert from wide to long format.
# Use read_csv to look at our compiled Sunshine list
sunshineWide.df <- read_csv("./data/sunshineList_subset_numID_wide.tsv")
## [1mRows: [22m[34m516966[39m [1mColumns: [22m[34m8[39m
## [36m--[39m [1mColumn specification[22m [36m------------------------------------------------------------------------------------------------[39m
## [1mDelimiter:[22m ","
## [31mchr[39m (7): 1996, 2001, 2006, 2011, 2016, 2021, 2023
## [32mdbl[39m (1): numericID
##
## [36mi[39m Use `spec()` to retrieve the full column specification for this data.
## [36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the structure and characteristics of sunshineWide.df
str(sunshineWide.df, give.attr = FALSE)
## spc_tbl_ [516,966 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ numericID: num [1:516966] 19206219 10627148 17402443 18586778 17626134 ...
## $ 1996 : chr [1:516966] "Other Public Sector Employers_Addiction Research Foundation_President & Ceo_$194,890.40_$711.24" "Other Public Sector Employers_Addiction Research Foundation_Dir., Soc. Eval. Research & Act. Dir., Clin. Resear"| __truncated__ "Other Public Sector Employers_Addiction Research Foundation_V.p., Research & Coordinator, Intern. Programs_$149,434.48_$512.58" "Ontario Public Service_Agriculture,Food And Rural Affairs_Deputy Minister_$109,382.92_$4,921.68" ...
## $ 2001 : chr [1:516966] NA NA NA NA ...
## $ 2006 : chr [1:516966] NA NA NA NA ...
## $ 2011 : chr [1:516966] NA NA NA NA ...
## $ 2016 : chr [1:516966] NA NA NA NA ...
## $ 2021 : chr [1:516966] NA NA NA NA ...
## $ 2023 : chr [1:516966] NA NA NA NA ...
head(sunshineWide.df)
## [90m# A tibble: 6 x 8[39m
## numericID `1996` `2001` `2006` `2011` `2016` `2021` `2023`
## [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m
## [90m1[39m 19[4m2[24m[4m0[24m[4m6[24m219 Other Public Sector Emplo~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m2[39m 10[4m6[24m[4m2[24m[4m7[24m148 Other Public Sector Emplo~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m3[39m 17[4m4[24m[4m0[24m[4m2[24m443 Other Public Sector Emplo~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m4[39m 18[4m5[24m[4m8[24m[4m6[24m778 Ontario Public Service_Ag~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m5[39m 17[4m6[24m[4m2[24m[4m6[24m134 Hospitals_Ajax And Picker~ Hospi~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m6[39m 13[4m8[24m[4m0[24m[4m8[24m138 Colleges_Algonquin Colleg~ Colle~ Colle~ Colle~ [31mNA[39m [31mNA[39m [31mNA[39m
tail(sunshineWide.df)
## [90m# A tibble: 6 x 8[39m
## numericID `1996` `2001` `2006` `2011` `2016` `2021` `2023`
## [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m
## [90m1[39m 13[4m4[24m[4m6[24m[4m9[24m140 [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m Other Public Sector Emplo~
## [90m2[39m 19[4m6[24m[4m7[24m[4m0[24m454 [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m Ontario Power Generation_~
## [90m3[39m 13[4m1[24m[4m3[24m[4m1[24m015 [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m Other Public Sector Emplo~
## [90m4[39m 18[4m6[24m[4m4[24m[4m5[24m579 [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m Other Public Sector Emplo~
## [90m5[39m 16[4m7[24m[4m1[24m[4m7[24m888 [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m Other Public Sector Emplo~
## [90m6[39m 15[4m4[24m[4m1[24m[4m7[24m227 [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m Other Public Sector Emplo~
any(is.na(sunshineWide.df))
## [1] TRUE
From looking at our data, we see there are 29 columns across 516966
observations. This represents 28 years of income data on 516,966 unique
individual IDs. From the outset, we can see there are some issues with
the data set that we’ll want to resolve and we’ll work through some
tidyverse functions in order to do that. First let’s
quickly review some of the potential problems with our dataset.
Under each year (column) we see a very interesting collection of
information. The data actually represents many variables and information
that appear to be separated by the underscore (_). These
data actually represent 5 variables: Sector, Employer, Job Title, Salary
Paid, and Taxable Benefits.
There are many NA values present. This is to be expected given that not every individual can possibly have income data across every year covered. Some individuals will enter later or retire during the 28-years of data.
Our Sector names and other column values will be quite inconsistent and we’ll want to address these by reformating them properly.
In the end, we want to convert our data to look something like this:
| numericID <fct> | salary <dbl> | taxableBenefits <dbl> | calendarYear <int> | sector <fct> | employer <chr> | title<chr> |
|---|---|---|---|---|---|---|
| 19206219 | $194,890.40 | $711.24 | 1996 | Other Public Sector Employers | Addiction Research Foundation | President & CEO |
| 10627148 | $115,603.62 | $403.41 | 1996 | Other Public Sector Employers | Addiction Research Foundation | Dir., Soc. Evl. Research & Act. Dir., Clin. Research |
| 17402443 | $149,434.48 | $512.58 | 1996 | Other Public Sector Employers | Addiction Research Foundation | V.p., Research & Coordinator, Intern. Programs |
| … | … | … | … | … | … | … |
Before we tackle these issues, let’s go ahead and review some of the tools at our disposal.
tidyverse package and it’s contents make
manipulating data easierWhile the tidyverse is composed of multiple packages, we will be
focused on working with a subset of these: dplyr,
tidyr, and stringr.
%>% whenever you
can!To save on making extra variables in memory and to help make our code
more concise, we should use of the %>% symbol. This is a
redirection or pipe symbol similar to the | in Unix
operating systems and is used for redirecting output from one function
to the input of another. By thoughtfully combining this with other
commands, we can alter or query our datasets with ease.
We’ll also introduce the %<>% in this class. This
is a little more advanced but it allows us to assign the final product
of our chain of commands to the very first object.
Whenever we are redirecting, we are implicitly passing our output to
the first parameter of the next function. We may not always want to use
the entirety of the output or we may want to also reuse that redirected
output as part of another parameter. To do so we can use .
to explicitly denote the redirected output.
Native Piping in R: Note that as of R vs 4.1.0 a
native pipe symbol |> was added to the language which
has the same function as the %>% symbol we are using.
However, RStudio holds a shortcut set of keys
ctrl + shift + m which can make it more convenient to
insert while coding.
dplyr has functions for accessing and altering
your dataWe will use the “verbs” of the dplyr function often to
massage the look of our data by changing column names or subsetting it.
The most common verbs you will see in this course are.
| Function(s) | Description |
|---|---|
arrange() |
Arranging rows by column values |
count(), tally() |
Counting observations by group |
distinct() |
Subsetting rows by distinct or unique values |
filter() |
Subsetting rows by column values |
mutate(), transmute() |
Create, modify, or delete columns |
select() |
Subset columns using their names and types |
summarize() or
summarise() |
Summarize by groups to fewer rows |
group_by() vs. ungroup() |
group by one or more variables |
rowwise() |
group data as single rows for calculations across each |
rename(), and relocate() |
Rename or move columns |
tidyr has additional functions for reshaping
our dataThe tidyr package will be most useful when we are trying
to reshape our data from the wide to the long format or vice
versa. This is much more useful for when we want to drastically
alter portions or all of our data.
| Function(s) | Description |
|---|---|
pivot_longer() |
Pivot data from wide to long |
pivot_wider() |
Pivot data from long to wide |
extract() |
Extract a character column into multiple groups |
separate() |
Separate a character column into multiple groups |
unite() |
Unite multiple columns into one by pasting strings |
drop_na() |
Drop rows containing missing values |
replace_na() |
Replace NAs with specific values |
stringr provides functionality for searching
data based on regular expressionsThe stringr package will come in most useful when we are
trying to fix string issues with our data. Many time our headers or data
will contain spaces or poor formatting. Many times we will prefer to
have our headers in lower case format, with any spaces replaced by an
_. We’ll also use verbs from this package to make any
variables or data more concise.
| Category | Function(s) | Description |
|---|---|---|
| String analysis | str_count() |
Count the number of matches in a string |
| String retrieval | str_detect() |
Detect the presence (or absence) of a pattern in string |
str_extract() and
str_extract_all() |
Extract matching patterns from a string | |
str_match() and
str_match_all() |
Extract matched groups from a string | |
str_subset() and
str_which() |
Keep or find strings matching a pattern | |
| String alteration | str_remove() and
str_remove_all() |
Remove matched patterns from a string |
str_split(),
str_split_fixed(), and str_split_n() |
Split a string into pieces | |
str_c() |
Concatenate multiple strings into a single string with optional separator | |
str_flatten() |
Flatten a string | |
str_sub() |
Extract and replace substrings from a character vector | |
str_to_upper() and
str_to_lower() |
Convert case of a string |
Time to tackle our dataset!
pivot_longer()As you may recall, our Sunshine data is formatted such that each column represents a host of information salary information for a given year from 1996-2023. However, for us to begin working with this data, we want to move it towards a long-format which requires that we have the data from each calendar year put into a single column, denoting each date. This way, each row will begin to represent a single observation for each unique ID, for a single year entry.
Today we will use the pivot_longer() function to convert
our wide-format data over to long-format. For our purposes, we will rely
on four parameters:
data: the data frame (and columns) that we wish to
transform.cols: the columns that we wish to gather/collapse into
a long format.names_to: the variable name of the new column
to hold the collapsed information from our current columns.values_to: The variable name of the values for each
observation that we are collapsing down.We’ll be using a series of %>% so for now we won’t
save our work to a new object.
# A reminder of what our data looks like
sunshineWide.df %>% head()
## [90m# A tibble: 6 x 8[39m
## numericID `1996` `2001` `2006` `2011` `2016` `2021` `2023`
## [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m
## [90m1[39m 19[4m2[24m[4m0[24m[4m6[24m219 Other Public Sector Emplo~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m2[39m 10[4m6[24m[4m2[24m[4m7[24m148 Other Public Sector Emplo~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m3[39m 17[4m4[24m[4m0[24m[4m2[24m443 Other Public Sector Emplo~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m4[39m 18[4m5[24m[4m8[24m[4m6[24m778 Ontario Public Service_Ag~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m5[39m 17[4m6[24m[4m2[24m[4m6[24m134 Hospitals_Ajax And Picker~ Hospi~ [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m [31mNA[39m
## [90m6[39m 13[4m8[24m[4m0[24m[4m8[24m138 Colleges_Algonquin Colleg~ Colle~ Colle~ Colle~ [31mNA[39m [31mNA[39m [31mNA[39m
# Start with our wide-format phu data
sunshineWide.df %>%
# Pivot the data into a long-format set
pivot_longer(cols = ..., names_to = ..., values_to = ...) %>%
# Just take a quick look at the output.
str()
## Error in str(.): '...' used in an incorrect context
NA observations from our data with
filter()Our conversion to long format creates 14,475,048 observations
relating our numericID data to calendarYear
but many of the entries in those observations simply have no values in
the combinedData variables since none exist. Therefore, we
want to remove those observations that are simply invalid.
One way we could have removed those non-existent values was in our
pivot_longer() call with the
values_drop_na = TRUE parameter. However, for the sake of
practice we’ll work with the filter() verb since we’ll be
using that a lot more throughout our workflows.
The filter() function take some of the following
parameters:
.data : the data set in question. When working with the
%>% operator, this is implicitly assigned by the output
from the last function.... : the series of predicates/conditions that you want
to filter with. This can be a simply conditional statement, or multiple
ones in a comma-separated format.In our case, we are going to use the is.na() function to
help us determine which rows to keep. We’ll save the result of this
initial transformation to a new variable
sunshineLong.df.
# Start with our wide-format phu data
# We'll save our pivoted data into a new variable to save some time in the future
sunshineLong.df <-
sunshineWide.df %>%
# Pivot the data into a long-format set
pivot_longer(cols = c(2:8), names_to = "calendarYear", values_to = "combinedData") %>%
# filter out our NA rows
filter(...)
## Error in sunshineWide.df %>% pivot_longer(cols = c(2:8), names_to = "calendarYear", : '...' used in an incorrect context
# Check out our resulting data
str(sunshineLong.df)
## Error in str(sunshineLong.df): object 'sunshineLong.df' not found
By filtering our NA values, we reduced our number of observations by about 2.9M entries!
separate()
functionLooking at our current wrangling output, we see that we still have to
deal with the combinedData column which has all of that
juicy underscore-separated information. We’ll use the
separate() function to help us break apart our column into
5 new columns.
For the separate() function we will use the following
parameters:
.data: the dataframe we will use for our
function.
col: the specific column we wish to break apart into
new columns.
into: a vector of names that we’ll use for naming
the new columns we are creating.
sep: the character(s) we want to use the separator
for our data. This can be a _ or : or whatever
we come across.
As this can be a computationally intensive step, we’ll also be saving
this to sunshineLong.df using the compound assignment
operator %<>%.
# Start with our long-format phu data
sunshineLong.df %<>%
# separate our combinedData column
separate(., col = combinedData,
into = c("sector", "employer", "jobTitle", ..., "taxableBenefits"),
sep = "_")
## Error in separate(., col = combinedData, into = c("sector", "employer", : object 'sunshineLong.df' not found
# take a quick look at the structure
str(sunshineLong.df)
## Error in str(sunshineLong.df): object 'sunshineLong.df' not found
stringr package to remove unwanted
charactersLooking above at the structure of our data, we can see that our
salaryPaid and taxableBenefits columns are of
the chr datatype. You can probably sense that intuitively
these should be numeric in nature. We cannot, however, just convert
these directly but must remove some characters like the “$” and “,”
characters that were placed in here.
We can use some simple verbs from the stringr package to
help us out. In the process we’ll use mutate() to alter
these same variables so we can save their updated state.
# Start with our long-format phu data
sunshineLong.df %<>%
# Mutate and update the values in salaryPaid and taxableBenefits
mutate(salaryPaid = str_remove_all(string = salaryPaid, pattern = ...),
taxableBenefits = str_remove_all(string = taxableBenefits, pattern = ...),) %>%
# Convert the updated variables to the correct data type
mutate(salaryPaid = ...(salaryPaid),
taxableBenefits = as.double(taxableBenefits),
calendarYear = as.integer(calendarYear),
numericID = ...(numericID))
## Error in mutate(., salaryPaid = str_remove_all(string = salaryPaid, pattern = ...), : object 'sunshineLong.df' not found
# take quick look at the structure
str(sunshineLong.df)
## Error in str(sunshineLong.df): object 'sunshineLong.df' not found
pull() verb to retrieve a column as a
vectorSometimes when you want to quickly assess your data, it can be very
helpful to isolate a column to look at its contents. To keep up with the
paradigm of piping our calls and keeping our code more readable, I
suggest the pull() verb to help retrieve single variables
form your data frame. These will be returned as a vector that you can
then pass along to functions like unique().
Here we will retrieve the sector variable to see just
how many different sectors there are in our data.
# Pull the sector variable and look at it's values
sunshineLong.df %>%
# Grab the sector data
pull(...) %>%
# Determine the unique values
... %>%
# Sort them for comparison
sort()
## Error in ...(.): could not find function "..."
stringr
verbsLooking at the output from the sector variable there are
a total of 32 unique values. We can see there are some interesting
things worth cleaning up in these categories though:
| sector entry | Updated/combined sector |
|---|---|
| “Universities - Universités” | Universities |
| “Municipalities” | Municipalities And Services |
| “Legislative Assembly” | Legislative Assembly And Offices |
Note: Seconded is used to define individuals that have been laterally moved to other departments or sectors for a temporary period (kind of like being on loan). They are technically paid by their original employer BUT for the purposes of this dataset we’ll treat them like they are from the seconded department. Otherwise we would have to do a lot more data wrangling.
sunshineLong.df %<>%
# Remove the values of Seconded from the various sector names and only keep what is within the parentheses
mutate(sector = str_replace_all(sector, pattern = r"(Seconded \((.*)\))", replacement = "Ministry: \\1")) %>%
# Remove any extra spaces or asterisks from these entries
mutate(sector = str_remove_all(sector, pattern = r"(\*|\s$)")) %>%
# Remove the "Government of Ontario" prefix
mutate(sector = str_remove_all(sector, pattern = r"(Government Of Ontario\s[-]\s)")) %>%
# Combine a few different sector categories into the same ones
# This is mostly just the result of inconsistencies from year to year
mutate(sector = str_replace_all(sector,
pattern = "Universities - Universités",
replacement = ...)) %>%
mutate(sector = str_replace_all(sector,
pattern = ...,
replacement = "Municipalities And Services")) %>%
mutate(sector = str_replace_all(sector,
pattern = "Legislative Assembly$",
replacement = "Legislative Assembly And Offices")) %>%
# Now that we've completed our changes, convert the variable to a factor
mutate(sector = as.factor(sector))
## Error in mutate(., sector = str_replace_all(sector, pattern = "Seconded \\((.*)\\)", : object 'sunshineLong.df' not found
# Take a peek at the results
str(sunshineLong.df)
## Error in str(sunshineLong.df): object 'sunshineLong.df' not found
rename() variables for clarity or simplicityLooking at the output, we’ve whittled down our sectors from 32 to 29 so it should be easier when we start visualizing that data later in the future. Just a couple more steps before we are done with our wrangling.
Next up we’ll rename our variables just a little by simplifying them
using the rename() verb from dplyr. There are
a number of ways you could accomplish this without using
dplyr but the simplicity of it is nice. The parameters here
follow the format of newColumnName = oldColumnName for each
column name we want to alter.
# Pass long our sunshine list to rename the columns
sunshineLong.df %>%
rename(... = calendarYear,
... = jobTitle,
... = salaryPaid) %>%
# Take a peek at the results
str()
## Error in rename(., ... = calendarYear, ... = jobTitle, ... = salaryPaid): object 'sunshineLong.df' not found
relocate()The last cleanup we want to accomplish is to move salary
and taxableBenefits closer to the start of our data frame.
The reason for this is that these two columns represent actual data
points we are interested in while the others are more metadata that we
can use later on for sorting.
The relocate() verb from dplyr accomplishes
this with ease since we are not dropping or removing columns. It uses
some extra syntax to help accomplish its functions:
.data: the data frame or tibble we want to alter...: the columns we wish to move.before or .after: determines the
destination of the columns. Supplying neither will move columns to the
left-hand side.In fact, relocate() can be used to rename a column as
well but it will also be moved by default so consider the ramifications
of such an action!
Note: We could accomplish a similar result using the
select() command as well. It’s really up to what you’re
comfortable with but it is much simpler to use relocate()
when you are working with a large number of columns and you want to move
one to a specific location.
We’ll save this final bit of wrangling into the variable
sunshineFinal.df.
# Save our result into a new variable
sunshineFinal.df <-
# Pass along our sunshine list to rename the columns
sunshineLong.df %>%
rename(year = calendarYear,
title = jobTitle,
salary = salaryPaid) %>%
# relocate the measurement data to the left
relocate(..., ..., .after = numericID)
## Error in sunshineLong.df %>% rename(year = calendarYear, title = jobTitle, : '...' used in an incorrect context
# Take a peek at the results
head(sunshineFinal.df)
## Error in head(sunshineFinal.df): object 'sunshineFinal.df' not found
# Make a quick copy of our final table
sunshineLong_copy.df <- sunshineLong.df
## Error in eval(expr, envir, enclos): object 'sunshineLong.df' not found
Comprehension Question 3.2.9: In the above example we used the relocate() function to move the “salary” and “taxableBenefits” column to near the start of our data frame. What other methods could we use to accomplish the same feat? Use the below code cell to help yourself out.
# comprehension answer code 3.2.9
# Relocate our target column using the select() command
# Use this copy of the sunshine list
sunshineLong_copy.df %>%
# Rename some of the variables
rename(year = calendarYear,
title = jobTitle,
salary = salaryPaid) %>%
# relocate "salary" and "taxableBenefits" to the right of "numericID"
... %>%
head()
## Error in ...(.): could not find function "..."
At this point we have completed the data wrangling we want to
accomplish on this dataset. We’ve converted it to a long-format and
renamed the Sectors entries while removing any NA values
that may cause issues. There are a number of ways we could save this
data now either as a text file or in its current form as a data frame in
a .RData format.
write_delim(),
write_csv(), write_tsv(),
write_excel_csv()write_lines()save()load()Let’s try some of those methods now.
# Check the files names we currently have
print(dir("./data/"))
## [1] "Sunshine_linePlot_facet.png" "sunshineList_subset_numID_wide.tsv"
## [3] "sunshineListLong.RData" "sunshineListLong.tsv"
# Write sunshineFinal.df to a tab-delimited file
...(sunshineFinal.df, file = "./data/sunshineListLong.tsv")
## Error in ...(sunshineFinal.df, file = "./data/sunshineListLong.tsv"): could not find function "..."
# Check our file names after writing
print(dir("./data/"))
## [1] "Sunshine_linePlot_facet.png" "sunshineList_subset_numID_wide.tsv"
## [3] "sunshineListLong.RData" "sunshineListLong.tsv"
# Save our data frame as an object
save(sunshineFinal.df, file="./data/sunshineListLong.RData")
## Error in save(sunshineFinal.df, file = "./data/sunshineListLong.RData"): object 'sunshineFinal.df' not found
# Check our file names after saving
print(dir("./data/"))
## [1] "Sunshine_linePlot_facet.png" "sunshineList_subset_numID_wide.tsv"
## [3] "sunshineListLong.RData" "sunshineListLong.tsv"
readxl and writexl packages for
working with excel spreadsheetsNot all of your data may come as a comma- or tab-delimited format. In
the case of excel spreadsheets there are some packages available that
can also facilitate the parsing of these more complex files. The
readxl package is part of the tidyverse but
writexl package is not. There are other means of writing to
an excel file format but they are dependent on other programs (like Java
or Excel) or their drivers.
From the readxl package
excel_sheets()read_excel()From the writexl package (not a part of the tidyverse)
but independent of Java and Excel
write_xlsx()We now have our data in a tidy format - every row is an observation and every column is a variable. While we only have a few numeric data points that are available for summary, we can actually generate quite a few bits of summary information. We’ll do this initially using data summary tables by generating grouped data frames.
group_by() to implicitly subset your
dataThe simplest way to subset your data for analysis is with the
group_by() verb. You can specify which variables you’d like
to use and it will automatically generate any pre-existing groups that
meet your criteria. While it is not necessary for the variables you want
to subgroup by to be factor datatypes, it can simplify
things since you can quickly calculate how many maximum combinations
might exist.
Once the data is grouped, we can use summarise() to
create basic summaries on some of the variables. To help us focus, we’ll
try to answer a few simple questions:
We can approach this question by recognizing that we want to group our data by year to start with. We are interested in mean salary AND knowing which year had the highest mean salary. To break our analysis into steps we would:
# Pass along the data for grouping
sunshineFinal.df %>%
... %>%
summarise(total = ..., # Calculate group size
meanSalary = ...) %>% # Calculate mean salary for the group
arrange(desc(meanSalary)) # sort our data
## Error in arrange(., desc(meanSalary)): '...' used in an incorrect context
So 2006 had our highest mean salary. What would it look like if we plotted this data?
We can’t answer these all right away but we can approach it with the same idea. Always start with a plan!
# Pass along the data for grouping
sunshineFinal.df %>%
# Group the data by year and sector
group_by(year, sector) %>%
summarise(total = n(), # Calculate group size
meanSalary = mean(salary), # Calculate mean salary for the group
totalSalary = ...) %>% # Calculate the total salary for the group
# regroup the data just by year
group_by(...) %>%
# Recalculate the max values in each group
summarise(maxGroupSize = ...(total),
maxMean = ...(meanSalary),
maxTotalSalary = ...(totalSalary))
## Error in summarise(., maxGroupSize = ...(total), maxMean = ...(meanSalary), : '...' used in an incorrect context
This one can seem a little tricky to work out but it’s really a variant of our previous question. Instead of grouping a second time by year, however, we can group by sector to analyse the historical data for each sector over our timespan.
# Pass along the data for grouping
sunshineFinal.df %>%
# Group the data by year and sector
group_by(year, sector) %>%
summarise(total = n()) %>% # Calculate group size
# Regroup by sector only
group_by(...) %>%
# Summarise based on each sector as a group
summarise(maxGroupSize = max(total),
meanGroupSize = mean(total),
stdevGroupSize = sd(total)) %>%
# Rearrange the data based on the biggest group size
arrange(desc(maxGroupSize))
## Error in summarise(., maxGroupSize = max(total), meanGroupSize = mean(total), : '...' used in an incorrect context
Well, it looks like School Boards have had the highest overall group size over the past 28 years, however, their yearly mean size isn’t quite as large as that of municipalities and services. What we are lacking, however, is the ability to easily see things like trends over time which could tell us if, for instance, School Board-based civil servants are increasing in size over time or shrinking, etc!
ggplot2While we were able to quickly obtain some cursory information using a
group_by() and summarize() approach, it can be
hard to dig through the rows and rows of observations in our data. We
went to all that trouble to put our data into a tidy format, not just
for summarizing but also for ease of visualization! While we will go
into ggplot2 in much greater depth during lectures 2 and
3, let’s begin our journey now with a little bit of the
basics.
We can begin with some initial analyses of the data using the
ggplot2 package. It has all of the components we need to
help us decide on which data we want to focus on or keep. There are a
number of ways to visualize our data and here we will refresh our
ggplot skills.
Basic ggplot notes:
ggplot objects hold a complex number of attributes but
always need an initial source of dataggplot objects can be modified with the +
symbol by adding in layers
ggplot objects can be plotted, saved, and passed
around.We’ll begin with a trimmed down dataset where we will filter out the “Ministry:” based entries. Remember, these were originally called “Seconded” sectors and will only take a small part of our dataset. That’s right, it’s easy to filter your data on-the-fly before passing it on to ggplot!
As we start to produce plot figures, they’ll vary in size depending on your needs. In an R Markdown code cell, you can set your figure size using the code cell attributes much like the parameters of a function. You can set the figure size dimensions using fig.width and fig.height. As we proceed in the future, you’ll see us setting these attributes within our code cells.
# Initialize a plot with our summarized data
sunshine.plot <-
# Pass the original data
sunshineFinal.df %>%
# Filter out the "Ministry" datapoints
filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>%
# Group and summarise the data
group_by(year, sector) %>%
summarise(total = n(), # Calculate group size
meanSalary = mean(salary), # Calculate mean salary for the group
totalSalary = sum(salary)) %>% # Total salary spent on the group
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a quick look at the structure of the data
str(sunshine.plot)
## Error in str(sunshine.plot): object 'sunshine.plot' not found
We now have a basic plot object initialized but we need to tell it how to display the data associated with it. We’ll begin with a simple line graph of mean salary for all sectors across all dates within the set.
In order to update or add layers to a ggplot object, we
can use the + symbol for each command. For instance, to
define the source of x-axis and y-axis data, we use aes()
command to update the aesthetics layer. Remember how we defined the
sector variable as a factor? We’ll take advantage of that
here and tell ggplot to give each sector it’s own
colour.
After defining our aesthetics, we still need to tell
ggplot how to actually graph the data. The
ggplot package comes with an abundance of visualizations
accessed through the geom_*() commands. Some examples
include
geom_point() for scatterplotsgeom_line() for line graphsgeom_boxplot() for boxplotsgeom_violin() for violin plotsgeom_bar() for bargraphsgeom_histogram() for histograms# Update the aesthetics with axis and colour information, then add a line graph!
sunshine.plot +
# 2. Aesthetics
aes(x = ..., y = ..., colour = ...) +
theme(text = element_text(size = 20)) + # set text size
# Give titles to your axes
guides(colour = guide_legend(title="Sector")) + # Legend title
xlab("Year") + # Set the x-axis label
ylab("Mean Salary") + # Set the y-axis label
# 4. Geoms
geom_line()
## Error in eval(expr, envir, enclos): object 'sunshine.plot' not found
facet_wrap() command to break Sectors
into separate graphsThere’s a lot of data on that graph and some of it is quite drowned
out because of the scale of some Sectors with much higher salaries. To
break out each sector individually, we can add the
facet_wrap() command. We’ll also update some of the
parameters:
scale: we will update this so each y-axis scale is
determined by PHU-specific data.ncol: use this to set the number of columns displayed
in our gridAt the same time, we’ll also get rid of the legend since each individual graph will be labeled by its sector.
# Add a facet_wrap and get rid of the legend
sunshine_facet.plot <- sunshine.plot +
# 2. Aesthetics
aes(x = year, y = meanSalary, colour = sector) +
theme(text = element_text(size = 14)) + # set text size
# Give titles to your axes
xlab("Year") + # Set the x-axis label
ylab("Mean Salary") + # Set the y-axis label
ggtitle("Mean sunshine salary per year across sectors") +
# Remove the legend
theme(legend.position = "none") +
# 4. Geoms
geom_line() +
# 7. ### 4.2.2 Facet our data by sector
facet_wrap(~ ..., scales = ..., ncol=...)
## Error in eval(expr, envir, enclos): object 'sunshine.plot' not found
# Display our plot
sunshine_facet.plot
## Error in eval(expr, envir, enclos): object 'sunshine_facet.plot' not found
ggsave() command to save your plots to a
fileThere are a number of ways you can use the ggsave()
command to specify how you want to save your files.
# What is our working directory?
getwd()
## [1] "C:/Users/mokca/Dropbox/!CAGEF/Course_Materials/Advanced_Graphics_in_R/2025.03_Adv_Graphics_R/Lecture_01_R_Introduction"
# Save the plot we've generated to the root directory of the lecture files.
ggsave(...,
filename = "data/Sunshine_linePlot_facet.png",
scale=2,
device = "png",
units = c("cm"), width = 20, height = 30)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a look at the directory
dir("data/")
## [1] "Sunshine_linePlot_facet.png" "sunshineList_subset_numID_wide.tsv"
## [3] "sunshineListLong.RData" "sunshineListLong.tsv"
Although we do have a running total for each year, what if we want to look at the totals individuals across our sectors? Using a barplot we can stack sectors by year and get a sense of yearly totals individuals by sector.
This time we will use geom_bar() to display our data and
tell it to use the values from our total variable in our
data to generate the totals. We do this by setting the
stat = "identity" parameter.
sunshine.plot +
# 2. Aesthetics
aes(x = year, y= total, fill = ...) + # set our fill colour instead of line colour
theme(text = element_text(size = 14)) + # set text size
guides(fill = guide_legend(title="Sector")) +
# Give titles to your axes
xlab("Year") + # Set the x-axis label
ylab("Total Individuals") + # Set the y-axis label
ggtitle("Yearly breakdown of servants by sector") +
# Set up our barplot here
geom_bar(...)
## Error in eval(expr, envir, enclos): object 'sunshine.plot' not found
Look like our number of public servants with salaries above $100K is rising year-by-year! That should be a good thing! Going back to our third question from section 4.1.3 we can see that visually, the school boards sector in recent years has had the most employees on the Sunshine list but in the years before that, Municipalities and Services tended to be the larger group.
Returning back to our question, how does total salary payout look
between our various sectors? We can quickly change up our graph
parameters so that we are viewing totalSalary instead. We
just need to set our y-axis properly.
sunshine.plot +
# 2. Aesthetics
aes(x = year, y= ..., fill = sector) + # set our fill colour instead of line colour
theme(text = element_text(size = 14)) + # set text size
guides(fill = guide_legend(title="Sector")) +
# Give titles to your axes
xlab("Year") + # Set the x-axis label
ylab("Total Salary Paid") + # Set the y-axis label
ggtitle("Yearly breakdown of total salary paid by sector") +
# Set up our barplot here
geom_bar(stat = "identity")
## Error in eval(expr, envir, enclos): object 'sunshine.plot' not found
It looks nearly identical to our breakdown of size. This is actually pretty good to see as it suggests that salaries in these groups tends to be very similar. You would need to do more in-depth analyses BUT we can leave that for your assignment.
It would also be good to determine more clearly, however, what percentage of each year the various sectors comprise but we’ll save that for next week.
geom_point()Before we wrap up, let’s take a closer look at our data by zooming in on a single year. We’ll filter our data to look at data from 2023 and then plot all of the salaries as single datapoint, categorized by sector.
Using the geom_point() layer, we’ll be able to plot each
observation in our dataset. The resulting visualization would be
considered a strip-plot rather than a standard scatterplot or
biplot.
Note that we are also accessing the theme() layer here
to adjust parts of our plot. We’ll spend most of lecture
03 learning to manipulate many of our smaller details. For now,
we’ll adjust our x-axis text so it is at a 45-degree angle.
sunshineFinal.df %>%
# Filter the data by year
filter(year == ...,
str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>%
ggplot() +
# 2. Aesthetics
aes(x = sector, y = salary, colour = sector) +
# Remove the legend
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + # rotate our x-axis text to 45 degrees
# Give titles to your axes
xlab("Sector") + # Set the x-axis label
ylab("Salary Paid") + # Set the y-axis label
ggtitle("2024 breakdown of salaries paid by sector") +
# 4. Geoms
...
## Error in filter(., year == ..., str_detect(string = sector, pattern = "Ministry:", : object 'sunshineFinal.df' not found
Wow, some folks at Ontario Power Generation are making a LOT of money! That’s a lot of pay for steering a company that doesn’t have many competitors within the province! From our figure we get a sense of the pay range in each sector, although we can’t properly see the full distribution of our sectors. We’ll work on that in the coming weeks.
That’s our first class! If we’ve made it this far, we’ve reviewed
ggplot2We took a “messy” dataset from the Ontario government and created a tidy data set that we were able to visualize. We also took the time to summarize our data based on specific groups to get a better picture of how salaries are distributed across sectors and over time.
Next week? Getting deeper into ggplot2!
This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: March 14th, 2024.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 3.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2025 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
lubridate package: https://r4ds.had.co.nz/dates-and-times.htmlAs of 2022-03-01, the latest stable R version is 4.4.3:
Windows:
- Go to http://cran.utstat.utoronto.ca/
- Click on ‘Download R for Windows’
- Click on ‘install R for the first time’
- Click on ‘Download R 4.4.3 for Windows’ (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the
instructions.
(Mac) OS X:
- Go to http://cran.utstat.utoronto.ca/
- Click on ‘Download R for (Mac) OS X’
- Click on R-4.4.3 .pkg (or a newer version)
- Open the .pkg file once it has downloaded and follow the
instructions.
Linux:
- Open a terminal (Ctrl + alt + t) - sudo apt-get update
- sudo apt-get install r-base
- sudo apt-get install r-base-dev (so you can compile packages from
source)
As of 2025-03-05, the latest RStudio version is 2024.12.1+563 (released 2025-02-13)
Windows (10/11):
- Go to https://posit.co/downloads/
- Click on ‘RSTUDIO-2024.12.1-563.EXE’ to download the installer (or a
newer version)
- Double-click on the .exe file once it has downloaded and follow the
instructions.
(Mac) OS X (11+):
- Go to https://posit.co/downloads/
- Click on ‘RSTUDIO-2024.12.1-563.DMG’ to download the installer (or a
newer version)
- Double-click on the .dmg file once it has downloaded and follow the
instructions.
Linux:
- Go to https://posit.co/downloads/
- Click on the installer that describes your Linux distribution,
e.g. ‘RSTUDIO-2022.12.0-353-AMD64.DEB’ (or a newer version)
- Double-click on the .deb file once it has downloaded and follow the
instructions.
- If double-clicking on your .deb file did not open the software
manager, open the terminal (Ctrl + alt + t) and type sudo dpkg
-i /path/to/installer/RSTUDIO-2024.12.1-563-AMD64.deb
_Note: You have 3 things that could change in this last command._
1. This assumes you have just opened the terminal and are in your home directory. (If not, you have to modify your path. You can get to your home directory by typing cd ~.)
2. This assumes you have downloaded the .deb file to Downloads. (If you downloaded the file somewhere else, you have to change the path to the file, or download the .deb file to Downloads).
3. This assumes your file name for .deb is the same as above. (Put the name matching the .deb file you downloaded).
If you have a problem with installing R or RStudio, you can also try to solve the problem yourself by Googling any error messages you get. You can also try to get in touch with me or the course TAs.
RStudio is an IDE (Integrated Development Environment) for R that
provides a more user-friendly experience than using R in a terminal
setting. It has 4 main areas or panes, which you can customize to some
extent under
Tools > Global Options > Pane Layout:
All of the panes can be minimized or maximized using the large and small box outlines in the top right of each pane.
The Source is where you are keeping the code and annotation that you want to be saved as your script. The tab at the top left of the pane has your script name (i.e. ‘Untitled.R’), and you can switch between scripts by toggling the tabs. You can save, search or publish your source code using the buttons along the pane header. Code in the Source pane is run or executed automatically.
To run your current line of code or a highlighted segment of code
from the Source pane you can:
a) click the button
'Run' -> 'Run Selected Line(s)',
b) click 'Code' -> 'Run Selected Line(s)' from the menu
bar,
c) use the keyboard shortcut CTRL + ENTER (Windows &
Linux) Command + ENTER (Mac) (recommended),
d) copy and paste your code into the Console and hit Enter
(not recommended).
There are always many ways to do things in R, but the fastest way will always be the option that keeps your hands on the keyboard.
You can also type and execute your code (by hitting
ENTER) in the Console when the
> prompt is visible. If you enter code and you see a
+ instead of a prompt, R doesn’t think you are finished
entering code (i.e. you might be missing a bracket). If this isn’t
immediately fixable, you can hit Esc twice to get back to
your prompt. Using the up and down arrow keys, you can find previous
commands in the Console if you want to rerun code or fix an error
resulting from a typo.
On the Console tab in the top left of that pane is your current working directory. Pressing the arrow next to your working directory will open your current folder in the Files pane. If you find your Console is getting too cluttered, selecting the broom icon in that pane will clear it for you. The Console also shows information: upon start up about R (such as version number), during the installation of packages, when there are warnings, and when there are errors.
In the Global Environment you can see all of the stored objects you have created or sourced (imported from another script). The Global Environment can become cluttered, so it also has a broom button to clear its workspace.
Objects are made by using the assignment operator
<-. On the left side of the arrow, you have the name of
your object. On the right side you have what you are assigning to that
object. In this sense, you can think of an object as a container. The
container holds the values given as well as information about ‘class’
and ‘methods’ (which we will come back to).
Type x <- c(2,4) in the Console followed by
Enter. 1D objects’ data types can be seen immediately as
well as their first few values. Now type
y <- data.frame(numbers = c(1,2,3), letters = c("a","b","c"))
in the Console followed by Enter. You can immediately see
the dimension of 2D objects, and you can check the structure of data
frames and lists (more later) by clicking on the object’s arrow.
Clicking on the object name will open the object to view in a new tab.
Custom functions created in session or sourced will also appear in this
pane.
The Environment pane dropdown displays all of the currently loaded packages in addition to the Global Environment. Loaded means that all of the tools/functions in the package are available for use. R comes with a number of packages pre-loaded (i.e. base, grDevices).
In the History tab are all of the commands you have executed in the Console during your session. You can select a line of code and send it to the Source or Console.
The Connections tab is to connect to data sources such as Spark and will not be used in this lesson.
The Files tab allows you to search through directories; you can go to
or set your working directory by making the appropriate selection under
the More (blue gear) drop-down menu. The ...
to the top left of the pane allows you to search for a folder in a more
traditional manner.
The Plots tab is where plots you make in a .R script will appear (notebooks and markdown plots will be shown in the Source pane). There is the option to Export and save these plots manually.
The Packages tab has all of the packages that are installed and their versions, and buttons to Install or Update packages. A check mark in the box next to the package means that the package is loaded. You can load a package by adding a check mark next to a package, however it is good practice to instead load the package in your script to aid in reproducibility.
The Help menu has the documentation for all packages and functions. For each function you will find a description of what the function does, the arguments it takes, what the function does to the inputs (details), what it outputs, and an example. Some of the help documentation is difficult to read or less than comprehensive, in which case goggling the function is a good idea.
The Viewer will display vignettes, or local web content such as a Shiny app, interactive graphs, or a rendered html document.
I suggest you take a look at Tools -> Global Options
to customize your experience.
For example, under Code -> Editing I have selected
Soft-wrap R source files followed by Apply so
that my text will wrap by itself when I am typing and not create a long
line of text.
You may also want to change the Appearance of your code.
I like the RStudio theme: Modern and
Editor font: Ubuntu Mono, but pick whatever you like!
Again, you need to hit Apply to make changes.
That whirlwind tour isn’t everything the IDE can do, but it is enough to get started.
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.